The modal social science project starts by importing existing datasets. Datasets come in all shapes and sizes. As you search for new data you may encounter dozens of file extensions – csv, xlsx, dta, sav, por, Rdata, Rds, txt, xml, json, shp … the list continues. Although these files can often be cumbersome, its a good to be able to find a way to encounter any file that your research may call for.

Reviewing data import will allow us to get on the same page on how computer systems work.

Tutorials

Sometimes a third-party primer is more effective than once-a-week section or office hours. We can provide guidance on what primers are most effective for this course, and for R I recommend: https://rstudio.cloud/learn/primers

Orienting

  1. We will be using a cloud version of RStudio at https://rstudio.cloud. You should join the Math Prefresher Space 2018 from the link that was emailed to you. Each day, click on the project with the day’s date on it.

    Although most of you will probably doing your work on RStudio local rather than cloud, we are trying to use cloud because it makes it easier to standardize people’s settings.

  2. RStudio (either cloud or desktop) is a GUI and an IDE for the programming language R. A Graphical User Interface allows users to interface with the software (in this case R) using graphical aids like buttons and tabs. Often we don’t think of GUIs because to most computer users, everything is a GUI (like Microsoft Word or your “Control Panel”), but it’s always there!

    A Integrated Development Environment just says that the software to interface with R comes with useful useful bells and whistles to give you shortcuts.

    The Console is kind of a the core window through which you see your GUI actually operating through R. It’s not graphical so might not be as intuitive. But all your results, commands, errors, warnings.. you see them in here. A console tells you what’s going on now.

A Typical RStudio Window at Startup

A Typical RStudio Window at Startup

  1. Theoretically, one could do all their work in a Console. But that would be a lot of work, because you’d have to give instructions each time you start your data analysis. Moreover, you’ll have no record of what you did. That’s why you need a script. This is a type of code. It can be referred to as a source because that is the source of your commands. Source is also used as a verb; “source the script” just means execute it.

    RStudio doesn’t start out with a script, so you can make one from “File > New” or the New file icon.

Opening New Script (as opposed to the Console)

Opening New Script (as opposed to the Console)

  1. You can also open scripts that are in your folder. A script is a type of File. Find your Files in the bottom-right “Files” pane.
Opening an Existing Script from Files

Opening an Existing Script from Files

  1. In R, there are two main types of scripts. A classic .R file and a .Rmd file (for Rmarkdown). A .R file is just lines and lines of R code that is meant to be inserted right into the Console. A .Rmd tries to weave code and English together, to make it easier for users to create reports that interact with data and intersperse R code with explanation. For example, we built this book in Rmds.

The Rmarkdown facilitates is the use of code chunks, which are used here. These start and end with three back-ticks. In the beginning, we can add options in curly braces ({}). Specifying r in the beginning tells to render it as R code. Options like echo = TRUE switch between showing the code that was executed or not; eval = TRUE switch between evaluating the code. More about Rmarkdown in Section @ref(nonwysiwyg).

For example:

A code chunk in Rmarkdown (before rendering)

A code chunk in Rmarkdown (before rendering)

This code chunk would evaluate 1 + 1 and show its output when compiled, but not display the code that was executed.

Where is your File?

Computer files (data, documents, programs) are organized hiearchically, like a branching tree. Folders can contain files, and also other folders. To load a dataset, you need to specify where that file is.

We denote the hierarchy of a folder by the / (slash) symbol. data/input/2018-08 indicates the 2018-08 folder, which is included in the input folder, which is in turn included in the data folder.

Files (but not folders) have “file extensions” which you are probably familiar with already: .docx, .pdf, and .pdf already. The file extensions you will see in this course a lot are:

A typical file format is Microsoft Excel. Although this is not usually the best format for R because of its highly formatted structure as opposed to plain text (more on this in Section (sec:wysiwyg)), recent packages have made this fairly easy.

For the first time using an outside package, you first need to install it.

install.packages("readxl")

After that, you don’t need to install it again. But you do need to load it.

library(readxl)

The package readxl has a website: https://readxl.tidyverse.org/. Other packages are not as user-friendly, but they have a help page with a table of contents of all their functions.

help(package = readxl)

Reading in Data

From the help page, we see that read_excel() is the function that we want to use. Look at the help page. How do you read a help page?

Let’s try it.

poll <- read_excel("data/input/HHP_August2018_data_csv.xlsx")

What does the / mean? Why do we need the input term first? Does the argument need to be in quotes?

Or csv

anscombe <- read_csv("data/input/anscombe_long.csv")

You need to load a package to use a function.

library(tidyverse)
anscombe <- read_csv("data/input/anscombe_long.csv")
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   dataset = col_integer(),
##   x = col_integer(),
##   y = col_double()
## )

Or dta

library(haven)
nlsw88 <- read_dta("data/input/nlsw88.dta")